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Abstract. How can tiie information that a set {Xi, . . . , X„} of random 
variables contains about another random variable S be decomposed? To 
what extent do different subgroups provide the same, i.e. shared or redun- 
dant, information, carry unique information or interact for the emergence 
of synergistic information? 

Recently Williams and Beer proposed such a decomposition based on 
natural properties for shared information. While these properties fix the 
structure of the decomposition, they do not uniquely specify the values of 
the different terms. Therefore, we investigate additional properties such 
as strong symmetry and left monotonicity. We find that strong symme- 
try is incompatible with the properties proposed by Williams and Beer. 
Although left monotonicity is a very natural property for an information 
measure it is not fulfilled by any of the proposed measures. 
We also study a geometric framework for information decompositions and 
ask whether it is possible to represent shared information by a family of 
posterior distributions. 

Finally, we draw connections to the notions of shared knowledge and 
common knowledge in game theory. While many people believe that in- 
dependent variables cannot share information, we show that in game 
theory independent agents can have shared knowledge, but not common 
knowledge. We conclude that intuition and heuristic arguments do not 
suffice when arguing about information. 



1 Introduction 

The field of complex systems investigates systems which are composed of many 
components or sub-systems. Such a system is considered as complex if these 
components interact in intricate ways and exhibit dependencies at all scales. 
Informally, complex systems are often described in terms of information that is 
exchanged between components. Thus, information theory is a natural tool to 
study complex systems. 

As an example from neural coding, consider two neurons which provide infor- 
mation about some stimulus. Many scientists have tried to uncover whether both 
neurons provide redundant information about the stimulus or act synergetically, 



i.e. provide information which can only be recovered when the joint response 
of both ceUs is recorded simultaneously |1I2) . Similarly, one could ask for the 
unique information of each response, i.e. information that can be obtained from 
one of the cells, but not the other. For example, the brain separates visual infor- 
mation into the where and what pathways [3] which potentially provide unique 
information with respect to each other. Another way to explain the intuition on 
how information can be decomposed, is to consider two agents which are inter- 
rogated about certain topics. For example, assume that one agent is an expert 
in physics and biology, whereas the other one has studied art and biology. In 
this case, both agents could answer questions about biology being their shared 
topic. Furthermore, each agent has additional unique information about physics 
and art, respectively. Considering their joint responses an interrogator might be 
able to draw interesting connections between art and physics none of the agents 
is aware of. This would correspond to the synergetic information in this case. 

In general, when considering more than two random variables, there may be 
different combinations of shared, unique and synergistic information, depending 
on how the information is distributed among the random variables. The total 
mutual information I{S : Xi, . . . ,X„) should then be a sum of different terms 
with a well-defined interpretation. At the moment, it is not clear how many such 
terms are necessary in the general case of n interacting elements. Williams and 
Beer recently proposed one such decomposition, which they call partial informa- 
tion (PI) decomposition [4]. This decomposition is naturally derived from simple 
intuitive properties that such a decomposition should satisfy. 

Before explaining the construction of Williams and Beer, we first have a 
look at the case of n = 2 explanatory variables in Section [51 In Section |3] we 
discuss natural properties that such a decomposition should satisfy and, follow- 
ing Williams and Beer, use these properties to derive the PI decomposition. In 
Section 2] we propose additional properties that relate the values of shared in- 
formation in situations where we ask for information about different variables. 
In Section [S] we discuss the measure /min proposed by Williams and Beer and 
compare it to another function //, i.e. the minimum of the pairwise mutual 
informations. We show that the function /min may decrease when we ask for 
information about a larger variable. In Section El we study the case for three 
variables. We show that it is difficult to assign intuitively plausible values to all 
partial information terms, even in the simple XOR-example. Using this example 
we show that the structure of the PI lattice is incompatible with a symmetry 
property which we call strong symmetry. 

In Section [7] we propose a geometric picture for information decomposition. 
This view provides an appealing mathematical structure and provides additional 
insights into the structure of information. Within this geometric framework, we 
compare our ideas to the measures proposed in and 0. Then, in Section |51 we 
study the game theoretic notions of shared and common knowledge that are used 
to describe epistemic states of multi-agent systems, and we discuss how these 
notions are related to the problem of decomposing information. We conclude 
with an outlook on the possibility of a general decomposition of information. 



2 The Case of Two Variables 



First, we fix the notation and recall some basic definitions from information 
theory |6]. We assume that a system consists of components Xi, . . . , X^. For 
simplicity we assume that the set of possible states Xi that a component Xi can 
be in is finite. Thus, the set of all possible states for the whole system is given 

by xf^_ = xili^.. 

Given a probability distribution p on X-^ , the Xi become random variables. 
Mutual information between two random variables X and Y quantifies the infor- 
mation about Y that is gained by knowing X and vice versa. It can be defined 
as 

I{X:Y) = Y^p{y)D{p{X\y)\\p{X)) (1) 

yey 

where D{j>[X\y)\\p{X)) = Y.o:exP{^\v)^og^ W Kullback-Leibler (KL) 

divergence between p{X\y) and p(X0 The KL divergence is often considered 
as a distance between probability distributions even though it is not a metric. 
But, like a metric, it vanishes if and only if the two distributions are identical. 
It can also be interpreted as an information gain: if one finds out that Y = y 
then D{p{x\y)\\p{x)) bits of information are gained about X. It is well known 
that the mutual information is symmetric and vanishes if and only if X and Y 
are independent. 

Consider now three random variables Xi,X2 and S. The (total) mutual in- 
formation 7(5; (Xi, X2)) quantifies the total information that is gained about S 
if the outcome of Xi and X2 is known. How do Xi and X2 contribute to this 
information? 

For two explanatory variables, we expect four contributions to I{S : 
I{S : = SI{S : Xi-X2)+UI{S : Xi\X2)+UI{S : X2\Xi)+CI{S : Xi-X2) 

m 

The shared (redundant) information SI{S : Xi;X2), the unique informations 
UI and the complementary (synergistic) information CI{S : Xi;X2). Intuition 
tells us that the individual mutual informations that are provided by Xi and X2 
should decompose as 

I{S : Xi) = SI{S : Xi; X2) + UI{S : X^ \ X2) 

I{S : X2) = SI{S : X,;X2) + UI{S : X2\Xi) . ^ ' 

Using the full decomposition ^ and the chain rule of mutual information [5] we 
find that the conditional informations correspond to unique and complementary 
information, e.g. 1(8 : X1IX2) = UI{S : Xi\X2)+CI{S : Xi;X2). Furthermore, 
we recover the fact that the co-information Ico 7] contemplates shared and 

^ Here, p{X) denotes the probability distribution of the random variable X. When 
referring to the probability of a particular outcome x £ X of this random variable, 
we write p{x). 



complementary information, i.e. 



IcoiS : : X2) I{S : Xi|X2)-/(5 : X^) = CI{S : Xi;X2)-SI{S : Xi;X2) 

(4) 

Unfortunately, the three linear equations ([2]) and ([3]) do not completely spec- 
ify the four functions on the right hand side of ([2]). To determine the decompo- 
sition ^ it is sufficient to define one of the functions SI, UI and CI. It seems 
to be a difficult task to come up with a reasonable and well-motivated definition 
of SI such that the induced definitions of UI and CI via equations Q and ^ 
are non- negative. The same is true when trying to find formulas for UI or CI. 
Note that any definition of the unique information fixes two of the terms in ([2]). 
This leads to the consistency condition 

I{S : Xi) + UI{S : X2 \ Xi) = I{S : X2) + UI{S : Xi \ X2), (5) 

which resembles the chain rule. Indeed, UI{S : Xi \ X2) can be considered as 
a version of conditional information which does not contain the complementary 
informatiorH. 

Apart from the problem of finding formulas for SI, UI and CI, a second 
problem is how to generalize the decomposition ^ to more than two explana- 
tory variables. A possible solution to both problems was recently proposed by 
Williams and Beer. 



3 Natural Properties of Shared Information and the 
Partial Information Lattice 

Williams and Beer [3] base their construction of a non-negative decomposition 
of I{S : Xi . . . Xn) on the notion of redundancy or shared information. Let 
Ai, . . . , Afc C {Xi, . . . , Xn}, and denote by /n(<5' : Ai; . . . ; Ak) the information 
about iS" that is shared among the random variables in the sets Ai, . . . , A^. It is 
natural to demand that /p satisfy the following properties: 



(GP) /n(5 
(So) In{S 
(I) US 



Ai; . . . ; Afe) > 0. (global positivity) 

Ai; . . . ; Afe) is symmetric in Ai, . . . , A^^. (weak symmetry) 

A) = I{S : A) equals the mutual information of S and A. 

( self-redundancy ) 

(M) /n('S' : Ai; . . . ; Afc) < /n(S' : Ai; . . . ; Afe_i), with equality if Ak~i is a subset 
of Afe. (monotonicity) 



^ A related notion has been developed in the context of cryptography to quantify the 
secret information. Although the secret information has a clear operational interpre- 
tation it cannot be computed directly, but is upper bounded by the intrinsic mutual 
information I{S : X\ ^ X2) [819) . Unfortunately, the intrinsic mutual information 
does not obey the consistency condition ([S]), and hence it cannot be interpreted as 
unique information in our sense. 



The properties (Sq), (I) and (M) have been proposed as axioms of shared 
information by Williams and Beer in [3]. As Williams and Beer observe, (GP) 
is a consequence of the other properties. Here we like to state it as a separate 
property, since we want to discuss what happens if we drop or relax some of 
these properties. 

The properties (So) and (M) imply that it is sufficient to define the function 
In{S : Ai; . . . ; A^) in the case that A^ ^ for * 7^ J- ^ family of sets 
Ai, . . . , Afc with this property is called an anti-chain. The anti-chains form a 
lattice with respect to the partial order defined by (Bi, . . . , B^) < (Ai, . . . , A;) 
if and only if for each i = 1, . . . ,1 there exists j G {1, . . . , fc} such that Bj C A^. 
If S is fixed, then (So) and (M) imply that In{S : •) is a monotone function on 
the lattice of anti-chains of {Xi, . . . ,X„}: If (Bi, . . . ,Bfc) < (Ai, . . . , A;), then 

In{S : Bi, . . . ,Bfc) = /n(5 : Bi, . . . ,Bfc, Ai, . . . , A^) < In{S : Ai, . . . , A;). 

This lattice is also called the partial information (PI) lattice. In this paper, we 
focus on the case of two or three random variables, and the corresponding lattices 
are depicted in Figures [T] and [2] 



a) {^1,^2} b) H{Xi,X2) 




{Xi} {X2} H{Xi) H{X2) 




{Xi}{X2} I{Xi:X2) 



Fig. 1. The PI lattice for two random variables, a) The sets corresponding to the nodes 
in the lattice, b) Tlie redundancies at the nodes for S — {Xi,X2}, assuming strong 
symmetry (see (Si) in Section 14)1. 



Properties (M) and (I) imply In{S : Ai;...;Afc) < In{S : Ai) = I{S : 
Ai) < I{S : Xi . . . Xn). To obtain a decomposition of this total mutual infor- 
mation, we need to associate to each element of the PI lattice a "local quantity" 
Id in such a way that 

/n(5: Ai;...;Afc)= ^ /s(5 : Bi, . . . , B,). 

(Bi,...3!)<(Ai,...,Afc) 

One can show, using the notion of a Mobius inversion, that such a function Iq 
always exists, and Ig is uniquely determined from /p. 

As an example consider again the case of two variables (Figure [T]) . When 
S is given, then the upper three terms in the lattice correspond to the mutual 
informations I{S : Xi), I(S : X2) and 7(5 : X1X2). The lowest term, Jn(5' : 



Xi; X2) is the shared information SI{S : Xi; X2). The PI decomposition is 



In{S : {X1X2}) = Id{S : {X,}{X2}) + Ia{S : {X^}) 

+ Ia{S : {X2}) + Io{S : {X1X2}), 
In{S : {X,}) = Io{S : {X,}{X2}) + Ia{S : {X^}), 
In{S : {X2}) = loiS : {Xi}{X2}) + Io{S : {X^}), 



/n(^ : {Xi}{X2}) = Ia{S : {XijjXa}). 



A comparison with ^ and shows that 



Ja(5 : {X1X2}) 
Ia{S : {Xi}) 
: {^2}) 



CI{S ■.Xi;X2), 
UI{S ■.Xi\X2), 
UI{S ■.X2\Xi), 
SI{S : {Xi}{X2}) 



As stated above, when /p is known, then can be computed uniquely using 
a Mobius inversion. In general, la may have negative values. In order to have a 
natural interpretation of the PI decomposition, we need to require: 



Local positivity can also be expressed as a condition on /p, see 

4 Further Natural Properties of Shared Information 

The properties presented in the preceding section were identified by Williams and 
Beer and are naturally related to the notion of the PI lattice. Unfortunately, they 
are not enough to specify the function /p uniquely. The properties are incomplete 
for mainly two reaons: First, they do not tell us much about the left hand side 
apart from the normalization condition (I). Second, they do not tell us enough 
about what happens when we add another argument on the right. 

In this section we propose natural properties that describe the role of the 
left-hand side. Our first proposal is the following property: 

(Si) /n(S' : Ai; . . . ; A^) is symmetric in 5, Ai, . . . , A^. (strong symmetry) 

In the following, we mostly consider the case that S ~ {Xi, . . . ,X„}, and in 
this case (M) and (Si) together imply that In{S : Ai;...;Afc) — /n(Ai : 
A2; . . . ; Afc), and hence we may omit the first argument S. 

Unfortunately, strong symmetry is not satisfied by many information the- 
oretic quantities that are used to quantify shared information or synergy, but 
nevertheless we think that it is natural: If /p has just two arguments, then strong 
symmetry does hold, since the mutual information is symmetric. In other words, 
the amount of information that one random variable Xi contains about another 
variable X2 is the same as the amount of information that X2 carries about Xi . 
It is natural to assume that an analogous statement should hold if In has more 



(LP) I9 > 0. 



(local positivity) 



than two arguments. Note that the co-information Ico is symmetric in aU its 
arguments. 

Under the strong symmetry assumption, if we consider two variables Xi and 
X2 and set S = {Xi, X2}, then all functions are fixed. The corresponding lattice 
is depicted in Figure [Dd). We will see later that, given the other properties, 
strong symmetry contradicts the local positivity in the case of three random 
variables Xi, X2, X3. The implications of this will be discussed later. 

A weaker property restricting the dependence on the first argument is the 
following: 

(LM) /n(5 : Ai; . . . ; A^) < IniSS' : Ai; . . . ; A^). (left monotonicity) 

This property captures the intuition that if Ai, . . . , A^ share some information 
about S*, then at least the same amount of information is available to reduce the 
uncertainty about the joint outcome of S and S' . Left monotonicity follows, of 
course, from monotonicity and strong symmetry. 

Another property, which is independent from strong symmetry and which 
also implies (LM), is the following: 

(LC) /n(55' : Ai; . . . ; Afc) = I^{S : Ai; . . . ; A^) + I^{S' : Ai; . . . ; Ak\S) 

(left chain rule) 

where /n(5' : Ai; . . . ; Aa;|S') is given by Y.sesP(^)^^^{S' : Ai; . . . ; Afe|s), i.e. 
all distributions are conditioned on s and then the average is taken to obtain a 
conditional information. This property is a natural generalization of the chain 
rule of mutual information. Moreover, a similar property is used in Shannon's 
axiomatic characterization of entropy. 

Unfortunately, the left chain rule is not fullfilled by any of the proposed 
measures for shared information that we discuss later. Nevertheless, we state 
it here, since we find it mathematically appealing. The same is true for left 
monotonicity: Most measures do not satisfy (LM), see Section [S) 

The left chain rule together with local positivity also implies the following 
property which has recently been proposed by [S]: 

(Ida) /n(Ai U A2 : Ai; A2) = /(Ai : A2). (identity) 

The identity property implies that If^{{Xi,X2\ : Xi;X2) vanishes if Xi and 
X2 are independent. At first sight it seems natural that independent random 
variables cannot share information. However, in Section [S] we will argue that 
they may indeed share information in this case. 



5 The Functions Imin and // 



Williams and Beer define a function /min(5', Ai, . . . , A^) which satisfies all their 
properties (GP), (So), (I) and (M) as follows: 



/„iin('5' : Ai;...;Afe) = p{s) min p(ai |s) log 



^ p{s) min ^ p{a, \s) log 



P(g|«») 
p{s) 

P{ai\s) 
P{ai) 



= ^ min ^p(ai,s) log 



p{a^)p{s) 



The idea is the following; For each i compare the prediction p{s\ai) of S by Aj 
with the prior distribution p{s) of S. Then combine a minimization over i with 
a suitable average using the joint distribution of Aj and S. 

The order of the minimization and the averaging plays a crucial role. If we 
interchange it, we obtain another function 

IiiS:A,;...;Ak)^ min^p(,s) ^p(a,|s) log ^ min{/(5 -.A,)}. 

* pi s j * 

s ai 1- \ y 

This function // satisfies the same properties, including local positivity (LP) 
(the proof of [4 that proves (LP) for /mi„ applies). Of course, // does not at 
all capture the intuition behind the notion of shared information; // just com- 
pares absolute values of mutual informations, without caring whether different 
variables contain "the same information." We will later argue that /min suffers 
from a similar flaw (in particular, /p = // in the examples considered below). 
Note that any function /p satisfying the properties (GP), (Sq), (I) and (M) 
satisfies In < Ii- In particular, /,„iii < //■ 

The function // satisfies left monotonicity. However, /min does not; For ex- 
ample, the following joint probability distribution 
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satisfies /min(5 : Xi; X^) = 3 + |(| logs 3 - 1) > /mi„(^^' : ^1; ^2) = I This 
example can be understood as follows: If 5 = 0, then both Xi and X2 have some 
information about S and thus contribute | log2 3 — 1 bits to /min in this case. 
However, if we additionally condition on S' , then in any case one of Xi or X2 
carries no information: To be precise, if (5*, S") = (0,0), then X2 is uniformly 
distributed, and if {S,S') = (0,1), then Xi is uniformly distributed. Thus, in 



both cases the minimization contributes zero bits to /mm- The remaining case 
{S, S') — (1, 1) is equivalent to the case 5 = 1, where both Xi and X2 are fixed, 
and contributes one bit with weight ^. 

Omitting the calculations we mention that the redundancy measure proposed 
by [S] (and denoted by Ihsp in Section [7]) also violates left monotonicity in the 
same example. 

6 The Case of Three Variables 



a) 123 b) _H"(123) 




Fig. 2. The PI lattice for n = 3. For simplicity the sets are abbreviated by juxta- 
posing the indices of the corresponding variables. For example, 12|13 corresponds to 
{Xi, X2}{Xi, X3}. a) The PI lattice, b) The redundancies at the nodes, assuming 
strong symmetry and S — {Xi, X2, Xs}- 

For three variables, the PI lattice is depicted in Figure [5^). Under the as- 
sumption of strong symmetry all but two values in this lattice are fixed, see 
Figure [2)d). The unknown values correspond to the information shared by three 
random variables. 

In the following, we discuss an example with three random variables Xi, X2, 
X^: Assume that Xi and X2 are independent binary random variables, and let 
X3 = Xi ® X2 , where © denotes the sum modulo 2 or the XOR- function. Note 
that this example is symmetric in Xi, X2 and X3. Figure [3^) shows the values of 



a) 2(0) 



b) 



ff(123) = 2 



2(0) 2(0) 2(0) 




H{12) = 2 H{13) = 2 H{23) = 2 





7(12 : 13) = 2 7(12 : 23) = 2 7(13 : 23) = 2 



1(0) 1(0) 1(0) ' 2(1) 77(1) = 1 77(2) = 1 77(3) = 1 





7(1 : 23) = 1 7(2 : 13) = 1 7(3 : 12) = 1 



1(0) 1(0) 1(0) 





7(1 : 2) = 7(1 : 3) = 7(2 : 3) = 



1(1) 



Fig. 3. Redundancies in tiie XOR-examplc: a) 7min(123, •) in the example. The numbers 
in parentheses are 7a(123, •). b) The shared information assuming strong symmetry. 



/mill and Id in this example for S = {^i, X2, ^3}; in other words, we decompose 
the information that the system has about itself. What is striking is that the 
lowest entry in this lattice does not vanish: According to I-mim Xi, X2 and X3 
share one bit of information, although they are pair- wise independent. This fact 
that independent variables may share information according to /min has also been 
observed and criticized in [5] . We will later give an argument from game theory 
that explains how independent variables can share information. Nevertheless, 
in our opinion one bit of shared information is too much in this situation: The 
absolute value of one bit of shared information needs to be compared to the fact 
that each of Xi , X2 , X3 does not carry more than one bit of information. Note 
that in the XOR-example /min = //■ 

A close analysis of this also reveals that strong positivity is incompatible 
with the PI lattice: 

Theorem 1. There is no measure of shared information that satisfies (Si), 
(M), (I) and (LP). 

Proof. Assume that /p is a monotone function on the PI lattice that satisfies 
strong symmetry (Si). In the PI lattice for the XOR-example we can express 
all values on the lattice in terms of entropies and mutual informations, with 
one exception, see Figure [SJj). Note that, by strong symmetry, Id{XiX2X3 : 
Ai; . . . ; Afc) = /n(Ai; . . . ; A^) whenever Ai U • • • U A^ C {Xi,X2,X3}. Com- 
paring with Figure we see that the information shared by Xi, X2 and X3 
must vanish by monotonicity, since the terms on the next layer also vanish, 
I{Xi,Xj) = for i ^ j. Only the information shared by the pairs {Xi,X2}, 
{XijX^} and {X2,X3} is not determined. However, we can bound these terms 
by the monotonicity. Similarly, we can compute bounds on Ig. Namely, 

l9{{Xi,X2};{Xi,X3};{X2,X3}) = In{{Xi,X2};{Xi,X3};{X2,X3}) 
- /n({^i}; {X2, X3}) - /n({^2}; {^1, ^3}) - /n({^3}; {^1, ^2}) ± 

< 2-3 = -1, 

where ±0 represents a sum of terms belonging to the lowest two layers of the PI 
diagram, and these terms all vanish. This calculation shows that local positivity 
is not possible. □ 

To resolve this problem, one of the properties mentioned in Theorem[l]has to 
be dropped. The easiest solution is to drop strong symmetry. What are the alter- 
natives? We have to keep self-redundancy (I) and local positivity (LP), since we 
want to find a decomposition of mutual-information into positive terms. There- 
fore, if we want to keep strong symmetry, we need to replace monotonicity (M). 
It is probably a good idea to keep the inequality condition in (M), but it is 
conceivable to replace the equality condition. However, one must keep in mind 
that the equality condition is essential in justifying the use of the PI lattice: 
Without this condition the values of the function /p on arbitrary collections of 
subsets are not determined by its values on the antichains, and so the PI lattice is 
not any more the natural domain of shared information. Therefore, without the 



equality condition in (M) we need to compute many more terms to completely 
specify /p- In turn, this means that there are many more local terms Iq. With 
these additional terms it may be possible to obtain local positivity and strong 
symmetry at the same time. 

Heuristically, what happens in the XOR-example is the following: The term 
I{Xi : X2X3) on the third layer in Figure [2] (counted from below) is equal to 
one bit, since we can compute Xi from X2 and X^, and hence I(Xi : X2X3) = 
H{Xi). Intuitively, the information shared between Xi and {X2, X3} is precisely 
the information contained in Xi. However, the three terms I{Xi : X2X3), I{X2 : 
X1X3) and /(X3 : X1X2) on the third layer are not independent, since Xi, X2 
and X^ are not completely independent, but only pairwise independent. Hence, 
if we compute the information shared by all three pairs, we cannot just add up 
these three bits: We have to subtract (at least) one bit, which we overcounted. 
Somehow this one bit that we overcounted does not have a place in the PI lattice. 

If we drop strong symmetry and keep the PI lattice, it is still the question how 
to distribute the information over the PI lattice in the XOR-example. In any case, 
monotonicity implies that In{S : X1X2; X2X3) < I{S : X1X2X3) = 2. 
On the other hand, the other three values on the third layer, the three mutual 
informations I{S : Xi), are all equal to one bit. These values restrict the possible 
values of Ig, and it is not easy to motivate a non- negative assignment on intuitive 
grounds, even for this simple example. 

7 A Geometric Picture of Shared Information 

One problem that makes it difficult to define shared information is that there 
is no known experimental way to extract shared information. In this section 
we want to assume that shared information can be extracted or modelled con- 
cretely. We not only search for a number that measures the amount of shared 
information, but we want to represent the information itself. 

As a motivation consider the case of two random variables X, Y from the 
perspective of coding theory. Suppose that we want to transmit information 
about X and Y over some channel. Then the capacity that we need must exceed 
the amount of information that we want to transmit. To transmit a single vari- 
able X, we need a capacity of H{X). To be precise, this statement only becomes 
true asymptotically: When we want to transmit a string of n values of n inde- 
pendent copies of X, then, for large n, if we have a channel with a capacity of 
H{X) per time unit Z\T, then the time needed to transmit X is roughly nAT. 
In the same sense, to transmit X and Y together, we need a channel of ca- 
pacity H{{X,Y}). Suppose that X was already transmitted, i.e. both sender 
and receiver know the value of X. As Shannon showed, in this case a channel 
of capacity H{Y\X) = H{{Y,X}) — H{X) is sufficient to transmit the remain- 
ing information, such that the receiver knows both X and Y. Hence, H{Y\X) 
has the natural interpretation of unique information of Y with respect to X, 
and as Shannon's theorem shows, the unique information can be isolated and 



transmitted separately. The question is: Which other parts of information can 
be isolated? 

As before, we consider information about a random variable S. We follow the 
paradigm that our information or belief about S can be encoded in a probability 
distribution p{S). Suppose that X is another random variable. If S is not inde- 
pendent of X, then a measurement of X gives us further information about S. 
For example, if we know that X = x, then our belief about S can be encoded 
in the conditional probability distribution p{S\x). Thus, the information that X 
carries about S can be encoded in a family {p(S\x)}x<£X of probability distribu- 
tions for S. These distributions encode the posterior beliefs about S conditioned 
on each outcome of X. 

As motivated by Shannon, information can be quantified by logarithms of 
probabilities: The information that the state of the variable S is equal to the 
specific value s is worth —log2{p{S = s)). Our uncertainty about S, when our 
knowledge is encoded in the distribution p(S), is then equal to the expected 
information gain when we learn the value of S: 

J2pis){-log{p{s))) =: H{S). 

s 

Similarly, the information that we gain when we learn that X — x is equal to 
the conditional entropy H{S\X = x) = — I^sP(s|.t) \og{p{s\x)). The (expected) 
information that X brings us about 5* is obtained by averaging H{S\X = x) and 
comparing the value with H{S); this agrees with the mutual information: 

^p{x)^pis\x)log{p{s\x)) - ^p{s)log{p{s)) 

X s s 

X s ^ ^ ^ ^ 

The situation can be pictured geometrically. Let Vs be the set of all proba- 
bility distributions for S. Geometrically, 

= I P : 5 ^ R : p{s) > 0,J2p{s) = 1 I 

is a simplex. The family {p{S\x)}x is a point configuration in Vs, indexed by the 
outcomes x of the random variable X. The information gain is then the mean 
reduction of uncertainty (in the sense of Shannon information) when replacing 
the prior ^(5*) with the family {p{S\x)}x- 

According to our geometric interpretation of information, the shared infor- 
mation that Xi,. . . ,Xk carry about S should also be representable as a weighted 
family of probability distribiitions for S. The question is how to construct this 
weighted family from the posteriors {p{S\xi)}xi and the joint distribution of 
Xi,. . . ,Xk and S. Suppose that we have found such a family representing the 



shared information, and denote it by {Pxi\x2\...\xki^)}xi,...,Xk - Then we want to 
quantify the shared information. There are two natural possibihties: 

SIir{S : Xi; . . .■,Xk) ■■= ^ ^p(s, xi, . . . , Xfe) log 

Xi,...,Xk s 

SIkl{S : Xi;...;Xk) -.^ ^ p{xi, . . . , Xk)D{px^lx^l,„lxJ\p). 

Xi,...,Xk 

The function SIkl has the advantage that it always satisfies global positivity, 
regardless of how we construct Pxi\x2\...\xk ■ By contrast, the function Slir directly 
measures the change of surprise when we replace the prior distribution p{s) with 
the distribution Pxi\x2\...\xk ■ Depending on how we construct Pxi\x2\...\xk the value 
of Slir may become negative. 

We would like to have the following properties: 

1. The construction should be symmetric in xi^ . . . ,Xk- 

2. If /c = 1, then we obtain the posterior: Pxi{S) — p{S\xi). 

3. More variables share less information: 

D{px,\...\xSS)\\p{S)) < D{p,^y„\,^_^{S)\\p{S)). 

These properties are related to the properties (S), (I) and (M) as stated above, 
but re-formulated to hold point wise for each joint outcome xi, . . . , Xfc. 
A natural candidate satisfying the above properties is given by 

Pxi\...\xk{S) ^ aTgm:\nlD(^\ip{S\xi)\\p{S)) : Ai>0,^Ai = l 

I i=l i 

Since the KL divergence is convex, the function p M> D{j)\\p{S)) has a unique 
minimum on any closed convex set. This shows that the above definition is 
well-defined. Moreover, the definition ensures that Pxi\...\xk{^) belongs to the 
convex hull of the posteriors p{S\xi) for i = 1, . . . , fc. This models the fact that 
Pxi\...\xk{^) only involves information that is present in these posteriors. In fact, 
Pxi\...\xk the least informative distribution from this convex set. 

The construction of Pxi\...\xk{^) iuiplies the following property, which gives 
an idea in which sense Pxi\...\xk i^) summarizes information shared among all the 
posteriors p{S\xi): 

Lemma 1. If all p{S\xi) satisfy some linear inequality, then Px-^^\...\x^{s2) satis- 
fies the same inequality. In particular: 

1. Ifp{si\xi) < p{s2\xi) for all i, then Pxi\...\xki^i) ^ Pxi\...\xki^2)- 

2. If p{s\xi) =Q for all i, then p^^\^^^y^^{s) ^ Q. 

Unfortunately, 5*/;^ violates monotonicity, and with SIkl the synergy can 
become negative. Both facts can be illustrated with the same example: 
From we find that 



Pxi\x2 \...\Xk \^) 

p{s) 



CI{S : Xi-X2) = I{S : - I{S : Xi) + SI{S : Xi;X2) 



and thus the non- negativity oi CI requires that 

SI{S : Xi;X2) > I{S : Xi) ~ I{S : Xi\X2) = Ico{S : Xi : X2) . (6) 

Now, if 5 is a function of X2, then I{S : Xi\X2) vanishes, and therefore ([S]) 
implies 81(3 : Xi;X2) > I{S : Xi). Together with (M) we obtain 

SI{S : Xi]X2) ^ I{S : Xi), if 5* is a function of . (7) 

Consider the following distribution 



S Xi X2 


p(s,a;i,X2) 





2/6 


1 


1/6 


1 1 


1/6 


1 1 1 


2/6 



The relative location of p(S') and the posteriors of S given one or two of Xi and 
X2 is visualized in Figure |4l Under this distribution S and Xi are positively 
correlated, while S = X2, and thus I{S : Xi\X2) = 0. Consider the case Xi = 
xi ^ X2 — X2 in which Xi and X2 have conflicting posterior about S, i.e. ^(5*1x2) 
assigns probability one to S — X2, whereas ^(5*1x1) assigns a higher probability 
to S — xi ^ X2- Thus, Pxi\x2iS) is equal to the prior p{S) in this case. On the 
other hand, if Xi = X2 = x, then both posteriors favor S — x. The convex hull 
of p{S\xi) and p{S\x2) is an interval, and the posterior p{S\xi) is the closest 
point to the prior p{S). Therefore, Px^^x^{S) = p{S\xi). In total, 

I{S :Xi)-SIkl{S ■.Xi;X2) 

= p{s,Xi,X2) {Dkl{p{S\xMS)) - Dkl{Px^\xAS)\\p{S))) 

Xl ,X2 

= X! p{s,xi,X2)Dkl{p{S\xi)\\p{S)) > 0, 

Xl^X2 

and therefore ([7]) is violated. One can check that in this case SIir{S : Xi; X2) also 
violates ([7]), but in the other direction. Therefore, Slir violates monotonicity. 



p{S\X2=0) p(S|Xi=0) p{S) p{S\Xi = l) p{S\X2 = l) 
• • • • • 

5s=o 5s=i 

Fig. 4. The construction of Pxi\x2 for the example to SIkl and SIw- The set of prob- 
ability distributions of the binary variable 5* is the interval between the two point 
measures 8s=o and 5s^i- The convex hull of p[S\X-i — 0) and p{S\X2 — 0) is marked 
in green. The closest point to the prior is p{S\Xi — 0). The convex hull of p{S\Xi = 0) 
and p{S\X2 = 1) is marked in red; it contains the prior. 



The geometric strategy pursued in this section can be compare with the 
strategy by Williams and Beer in [3] that leads to the definition of /min- The 



formula 



/i„i„(S' : Ai; . . . ; Afc) = ^p(s) min^p(a,,|s) log 



P{ai\s) 



P{a.i) 

= Y,p{s)mvaD{p{A,\S)\\p{A,)) 

s 

defining /niin(5; Ai; . . . ; An) is similar to the defining equation of SIkl, but 
involves the conditional distributions p{ai\s) of the input given the output 5*. In 
our opinion it is much more natural to work with distributions over the output 
variable 5*, since, after all, we are interested in information about S. Of course, 
the defining equation of /min can be rewritten in the form 

/niin(5' : Ai; . . . ; Afc) V p{s) min V p(a^ |s) log ^^^J""-* , 

i pis) 
s ai -TV/ 

which resembles the definition of Slir, but involves minimizing over the inputs. 

The proposed definition of the posteriors Pxi^\...\xk (*) involves similar ideas as 
the definition of shared information in ^ . We only sketch these connections and 
refer to the manuscript [5] for the precise definitions. To distinguish their func- 
tion from other functions we call it Ihsp- The definition of Ihsp{S ■ Xi;X2) 
involves approximating the posteriors p{s\xi) by the convex hull family of pos- 
teriors p{s\x2) for all possible values X2 of X2. However, as defined in [S] this 
approximation, denoted by P{xi\X2)i-^)j is not unique. Then 



/ffSP (5 : Xi; X2) = min I xi) log 

Lot-., P 



E, s, P{x2\Xi){s) 
p{s, X2)\0g 
p(s) 

S,X2 

Note that in both definitions of P{xi\,X2){^) ^-nd Pxi\...\xk{^) the notion of 
the convex hull is used as a means to describe the set of distributions that 
involve information contained in a set of posterior distributions. The difference 
between both approaches is that [5] do not try to extract and represent the 
joint information pointwise, but they try to model the information contained 
in Xi using the posterior distributions of X2. This breaks the symmetry, and 
therefore, in the end, one has to take a minimum. Furthermore, this definition 
is only meaningful in the case of two random variables and violates the left 
monotonicity (see Section [5]). 



8 Game Theoretic Intuitions 



Without an operational definition it is hard to decide which of the above prop- 
erties and geometric structures are best suited to capture the concept of shared 
information. In order to get a better idea of what is actually meant when talking 



about shared information, we highlight some aspects from the perspective of 
game theory. 

Scientists in both game theory [10^ and computer science [11] have studied 
how knowledge is distributed among a group of agents. Since knowledge can be 
regarded as certain information, results from these disciplines can provide addi- 
tional insights into shared information. The basic formalism of epistemic agents 
considers a set S of possible states of the world or situations. The knowledge 
of an agent i is represented as a partition Xi on S. Such a partition can be 
considered as a function Xi : S ^ Xi mapping states of the world to possible 
observations Xi that are available to the agenfl Thus, each agent i might not 
be able to observe the actual state s of the world, but given an observation Xi 
he considers all situations in X~'^{xi) = {s G 5 | Xi{s) = Xi} to be possible. 

Suppose that agent i observes Xi G Xi. Then i is said to know an event, 
corresponding to a subset E C S, if the event occurs in all situations that the 
agent holds possible given Xi, i.e. 

X-\x,) C E. 

This gives rise to the knowledge operators Ki : 2'^ ~^ 2^ taking an event E to 
all situations where agent i knows this event: 

Ki{E) = {s G 5 I agent i knows E given the observation Xi{s)}. (8) 

Ki{E) can itself be considered as an event. Using this operator Ki, we can 
compute the situations where an event E is shared knowledge between agents 
1, . . . , n, i.e. where every agent knows E: 

n 

SK{E) = fl K,{E) 

Note that this does not imply that every agent knows that every agents 
know E. The much stronger requirement that everyone knows E, and everyone 
knows that everyone knows this, and so on, is formalized by iterating the above 
construction and referred to as common knowledge: 

oo 

CK{E) = Pi SK^{E), where SK^{E) = {SK{- ■ ■ SK{E) ■■■)) {k iterations). 

fe=i 

As an example consider the case of three binary random variables Xi , X2 
and S, where Xi and X2 are independent and S consists of a copy of both of 
them. Then, the set of possible situations, i.e. the support of the joint distribution 
p{xi,X2, s), consists of four possible states: 

^ Note the similarity to the definition of a random variable as a measurable map 
from a probability space to outcomes. In fact, if we choose an arbitrary probability 
distribution on S, then the partition Xi, considered as a function S ^ Xi, becomes 
a random variable. 



Xi X2 s 
00 

1 01 

1 10 

1 1 11 

The information partitions correspond to the projections on the respective com- 
ponents of the joint state, e.g. 

Xr^(0) = {(0,0,00),(0,l,01)}, 

^2-1(1) = {(0,1,01), (1,1,11)}. 

For the event E = {(0,0,00), (0, 1,01), (1,0, 10)} we find that 
Ki(ii;) = {(0,0,00),(0,1,01)} 

ir2(ii;) = {(o,o,oo),(i,o,io)}, 

and therefore SKi,2{E) = {(0,0,00)} since both agents 1 and 2 can exclude 
the state (1, 1, 11) in this case. Thus, we conclude that there exists non-trivial 
shared information between Xi and X2, namely that 5 ^ 11, even though Xi 
and X2 are independent of each other and neither of them knows the state of 
the other. On the other hand, there is no common knowledge between Xi and 
X2, since SKi,2{SKi,2{E)) = 0. 

Note that /min('S' : Xi;X2) = Ii{S : Xi;X2) = 1 bit in this example, if we 
assume that Xi and X2 are independent and uniformly distributed. If we say that 
^min measures the shared information, then this implies that Xi and X2 have no 
unique information. This is surprising, given that Xi and X2 are independent. 
Regarding the game theoretic analysis we see that the shared knowledge only 
rules out one state. Thus, a reasonable definition of shared information might 
give a positive value to /n(5' : Xi;X2) even if I{Xi : X2) — 0, but should 
certainly stay below 1 bit. Maybe a value of log(4/3) would be a good idea, 
since the number of possibilities is reduced from four to three. Note that (Id2), 
as proposed in ^, would require that /n(5' : Xi;X2) = whenever I{Xi : X2) 
vanishes. 

At present, it is not clear how the difference between shared and common 
information could be formulated in information theoretic terms. One may also 
ask, whether a desired decomposition of information, should take into account 
shared information or rather refer to common information. It would probably 
be easier to use shared information in a decomposition, because otherwise one 
needs to decompose the information into terms describing the information that 
Xi knows that X2 knows, but X2 does not know whether it is known by Xi, 
and so on. On the other hand, common knowledge is represented as a partition 
(see [10]). and hence corresponds to a random variable after introducing a prob- 
ability measure on 5. In contrast, shared knowledge cannot be represented as a 
partition. Maybe this explains why it is difficult, and may even be impossible, 
to represent shared information as a random variable. 



Note that the condition (Ici2) takes into account the mutual information 
between elements Ai of the right hand side. Their relationship is not considered 
in the definition of shared knowledge, but only appears in the higher-order terms 
which are iterated in the case of common knowledge. Therefore, the property 
(Id2) is more natural for common information than for shared information. The 
same holds true for (LC), since (LC) implies (Id2). 

9 Conclusions 

We have discussed natural and intuitive properties that a measure of shared 
information should have. We have shown that some of these properties contradict 
each other. This shows that intuition and heuristic arguments have to be used 
with great care when arguing about information. 

In particular, we discussed the partial information decomposition and lattice 
introduced by Williams and Beer. We have shown that a positive decomposition 
according to the PI lattice contradicts another desirable property, called strong 
symmetry. We are unsure whether this is an argument against strong symmetry, 
or whether the PI lattice has to be refined, since it is difficult to assign plausible 
values to the PI decomposition for the XOR-example. 

Williams and Beer also proposed a concrete measure /min of shared infor- 
mation. We show that in some examples this measure yields unreasonably large 
values. The problem is that /min does not distinguish whether different random 
variables carry the same information or just the same amount of information. 
This phenomenon has also been observed by others. However, most people fo- 
cussed on the property that independent variables may share information about 
themselves. We argue, using ideas from game theory, that this fact in itself does 
not speak against /min; but we agree that the absolute value that lynin assigns 
to the shared information is too large. In our opinion, what is more striking, 
is that /min is not monotone in its left argument: Random variables share less 
information about more. 

We expect that further progress requires a more precise, operational idea 
of what shared information should be. We believe that our results provide ad- 
ditional insights, even thought we have mainly revealed pitfalls regarding the 
notion of shared information. Thus, despite some recent progress, the quest for 
a general decomposition of multi-variate information is still open. 
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